Machine Learning, Copula and Synthetic Data

Copulas and synthetic data play pivotal roles in statistical modeling, offering innovative solutions for various challenges in Machine Learning. Here, I will focus into the use of copulas for synthetic data generation.

Copulas are mathematical constructs used to model the dependence structure between random variables. Unlike traditional correlation measures, copulas separate the marginal distributions from the dependence structure, providing a more flexible and nuanced approach to capturing complex relationships. They are particularly useful in scenarios where traditional models might fail to capture the intricate dependencies between variables. In the specific case of syntetic data generation, what we need is to mimics the statistical properties of real-world data. But why do we need this “fake” data in the first place? Syntetic data are invaluable in scenarios where obtaining sufficient real data is challenging (small sample size) or when privacy concerns limit access to actual data. By creating synthetic datasets we can augment the available data, facilitating better model generalization and robustness.

Going back on the statistical properties we mentioned earlier, we are interested in the parameters governing the distribution of each variable separately (the marginals) and the dependency structure between them (the copula). Once these are known, we can generate new data from the same distribution and with the same correlation.

To give a simple example let’s take few variables from the classic Cars Dataset.

import warnings
warnings.filterwarnings('ignore')

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from copulas.multivariate import GaussianMultivariate
from copulas.univariate import ParametricType, Univariate

df = sns.load_dataset("mpg")
df=df.drop(columns=['origin', 'name'])
df=df.dropna()
df.columns

df=df[['horsepower', 'weight','acceleration','mpg']]
df.describe()

	horsepower	weight	acceleration	mpg
count	392.000000	392.000000	392.000000	392.000000
mean	104.469388	2977.584184	15.541327	23.445918
std	38.491160	849.402560	2.758864	7.805007
min	46.000000	1613.000000	8.000000	9.000000
25%	75.000000	2225.250000	13.775000	17.000000
50%	93.500000	2803.500000	15.500000	22.750000
75%	126.000000	3614.750000	17.025000	29.000000
max	230.000000	5140.000000	24.800000	46.600000

Let’s plot the kernel density distribution of 3 variables, the scatter plot of each pair and the corresponding correlation.

def corrdot(*args, **kwargs):
    corr_r = args[0].corr(args[1], 'pearson')
    corr_text = f"{corr_r:2.2f}".replace("0.", ".")
    ax = plt.gca()
    ax.set_axis_off()
    marker_size = abs(corr_r) * 1000
    ax.scatter([.5], [.5], marker_size, [corr_r], alpha=0.6, cmap="coolwarm",
               vmin=-1, vmax=1, transform=ax.transAxes)
    font_size = abs(corr_r) * 10 + 5
    ax.annotate(corr_text, [.5, .5,],  xycoords="axes fraction",
                ha='center', va='center', fontsize=font_size)

sns.set(style='white', font_scale=1)
g = sns.PairGrid(df[['horsepower', 'weight','acceleration']], aspect=1, diag_sharey=False)
g.map_lower(sns.regplot, lowess=True, ci=False, line_kws={'color': 'black'})
g.map_diag(sns.distplot, kde_kws={'color': 'black'})
g.map_upper(corrdot)
plt.show()

We can see all sorts of things here. Aside from the strong correlation among some of the variables, we see that they have different distribution. For example, acceleration is very normal distributed but the same cannot be said about the other two variables.

Before anything else, let’s try a simple model to predict mpg.

y = df.pop('mpg')
X = df

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30,random_state=42)

model = LinearRegression()
model.fit(X_train, y_train)

print(model.score(X_test, y_test))

0.6501833421053663

Now, can we simulate something so similar to the actual data that we would get the same score? Yes, we can thanks to copulas!!! We ca generate a synthetic dataset with the same underlying structure.

# Select the best PARAMETRIC univariate (no KDE)
univariate = Univariate(parametric=ParametricType.PARAMETRIC)


def create_synthetic(X, y):
    """
    This function combines X and y into a single dataset D, models it
    using a Gaussian copula, and generates a synthetic dataset S. It
    returns the new, synthetic versions of X and y.
    """
    dataset = np.concatenate([X, np.expand_dims(y, 1)], axis=1)

    distribs =  GaussianMultivariate(distribution=univariate)
    distribs.fit(dataset)

    synthetic = distribs.sample(len(dataset))

    X = synthetic.values[:, :-1]
    y = synthetic.values[:, -1]

    return X, y, distribs

X_synthetic, y_synthetic, dist= create_synthetic(X_train, y_train)

Let’s look at the individual distributions fitted by the algorithm.

parameters = dist.to_dict()
parameters['univariates']

[{'a': 2.505456580509649,
  'loc': 44.28600428264269,
  'scale': 24.186079118156652,
  'type': 'copulas.univariate.gamma.GammaUnivariate'},
 {'loc': 1604.6365783320787,
  'scale': 3779.0567202878065,
  'a': 1.4708615880361795,
  'b': 2.5202670202239155,
  'type': 'copulas.univariate.beta.BetaUnivariate'},
 {'loc': 1.0632403337723968,
  'scale': 73.46060125357005,
  'a': 20.470245065728314,
  'b': 83.5439968070011,
  'type': 'copulas.univariate.beta.BetaUnivariate'},
 {'loc': 9.84592394053172,
  'scale': 40.487634662917245,
  'a': 1.6832256330594189,
  'b': 3.2317290395770817,
  'type': 'copulas.univariate.beta.BetaUnivariate'}]

We see that the distributions (Gamma and Beta), and their corresponding parameters like location and scale. We can also take a look at the correlation that defines the join distribution.

parameters['correlation']

[[1.0, 0.8473079532001859, -0.7200908747599617, -0.8315209336313284],
 [0.8473079532001859, 1.0, -0.42047354940020193, -0.831115031494809],
 [-0.7200908747599617, -0.42047354940020193, 1.0, 0.43579734522625596],
 [-0.8315209336313284, -0.831115031494809, 0.43579734522625596, 1.0]]

Now it is time to look at all the synthetic variables and compare them with the original one. Let’s look at the same things. A summary of the dataset and the plot of the 3 variables.

syntDF=pd.DataFrame(np.concatenate([X_synthetic, np.expand_dims(y_synthetic, 1)], axis=1),columns=['horsepower', 'weight', 'acceleration','mpg'])

syntDF.describe()

	horsepower	weight	acceleration	mpg
count	274.000000	274.000000	274.000000	274.000000
mean	103.393953	3007.334850	15.565402	24.093801
std	36.227720	822.287724	2.765889	8.281429
min	49.084379	1650.853730	8.469497	10.183055
25%	76.679656	2299.624685	13.538932	17.291204
50%	95.181303	2933.743847	15.416592	22.956229
75%	119.586826	3537.785385	17.370376	30.233671
max	227.592858	5142.078155	23.603403	45.263817

The descriptive statistics are remarkably similar, reflecting the statistical properties emphasized earlier. This holds significant importance in our analysis. However, a closer examination of individual variable distributions and their correlations reveals disparities. The Kernel distributions have evidently undergone changes, and although the correlation values remain within the same magnitude range and exhibit the same sign, they are not identical.

g = sns.PairGrid(syntDF[['horsepower', 'weight','acceleration']], aspect=1, diag_sharey=False)
g.map_lower(sns.regplot, lowess=True, ci=False, line_kws={'color': 'black'})
g.map_diag(sns.distplot, kde_kws={'color': 'black'})
g.map_upper(corrdot)
plt.show()

Now that we have seen similarities and differences, let’s try to run the same simple linear model on the synthetic data.

model = LinearRegression()
model.fit(X_synthetic, y_synthetic)

print(model.score(X_test, y_test))

0.6068245805913557

Upon scrutinizing the results, it is evident that they are highly comparable, even with the constraint of limiting the possible distributions to the simplest univariate forms and utilizing only three variables. This implies that our Gaussian copula has effectively captured the critical statistical characteristics of the dataset essential for addressing the regression problem.